Data preparation

Reading in the data

dat <- read.xlsx("Hybridmobilität systemized.xlsx", sheet = "Labels", startRow = 2)

We’ve renamed each variable name to a shorter one, and marked variables which are not necessary with “X_” as a prefix. As such, we can now filter all the variables which has this prefix.

dat <- dat %>% 
   select(-starts_with("X_"))

Further, we’ve named each “importance” metric with “IMP”, and each “satisfaction” metric with “SAT”. As each metric gets asked twice (one for importance and one for satisfaction), we’ve labeled them with the same number after the IMP or SAT prefix (e.g. SAT_1 and IMP_1 correspond to the same metric). We can filter for them as well with their “ID” columns so we can combine them again at a later stage.

IMP_SAT <- dat %>% select("ID",
                          starts_with("SAT"),
                          starts_with("IMP"))

Additionally, we can filter for demographic variables, which are stored in the first 52 columns.

demographics <- dat %>% select("ID",
                               1:52)

Lastly, we can filter for the Profiling Questions, which might add additional information to the participant.

PR_ratings <- dat %>% select("ID",
                             starts_with("PR1"))

Cleaning the data

Now that we have 3 different datasets, we can start with datacleaning each of these datasets sepparately. We’ll start with the importance and satisfaction scores (which we will then use to calculate the “pain” points).

Cleaning Importance and Satisfaction Scores

As values are stored as “[number] = [interpretation]” in these metrics, and we only need the number, we extract the first character in these metrics and turn them into class numeric.

IMP_SAT <- IMP_SAT %>% 
   mutate_at(.vars = vars(starts_with("SAT")),
             .funs = ~str_extract(string= ., pattern = "^.")) %>% 
   mutate_at(.vars = vars(starts_with("IMP")),
             .funs = ~str_extract(string= ., pattern = "^.")) %>% 
   mutate_if(.predicate = is.character,
             .funs = as.numeric)

We can now calculate “paint point” scores for each of the 106 metrics, and to make this more efficient, we’ll use pivot longer to have 3 columns: the first identifies whether it was a importance or a satisfaction rating, the second which metric number it was (1-106), and the values are the ratings themselves.

IMP_SAT_long <- IMP_SAT %>% 
   pivot_longer(cols = starts_with("IMP")| starts_with("SAT"),
                names_to = c("IMP/SAT","Metric_Nr"),
                values_to = "Rating",
                names_sep = "_")
IMP_SAT_long %>% head()
## # A tibble: 6 × 4
##      ID `IMP/SAT` Metric_Nr Rating
##   <dbl> <chr>     <chr>      <dbl>
## 1     1 IMP       1              4
## 2     1 IMP       2              4
## 3     1 IMP       3              4
## 4     1 IMP       4              3
## 5     1 IMP       5              4
## 6     1 IMP       6              4

We’ll have to pivot this dataframe to a wider format (so we have IMP and SAT ratings in sepparate columns) and calculate the differences for each metric and costumer, which functions as our pain point for each costumer.

IMP_SAT_pain_points <- IMP_SAT_long %>% 
   pivot_wider(names_from = `IMP/SAT`,
               values_from = Rating) %>% 
   mutate(Diff = IMP-SAT)

IMP_SAT_pain_points %>% head()
## # A tibble: 6 × 5
##      ID Metric_Nr   IMP   SAT  Diff
##   <dbl> <chr>     <dbl> <dbl> <dbl>
## 1     1 1             4     4     0
## 2     1 2             4     4     0
## 3     1 3             4     4     0
## 4     1 4             3     3     0
## 5     1 5             4     4     0
## 6     1 6             4     3     1

Lastly, we can pivot this dataframe to a wider format again (so each person who filled out the survey has its own row again).

Pain_points <- IMP_SAT_pain_points %>% 
   select(ID, Metric_Nr, Diff) %>% 
   mutate(Metric_Nr = as.numeric(Metric_Nr)) %>% 
   arrange(ID, Metric_Nr) %>% 
   pivot_wider(names_from = Metric_Nr,
               values_from = `Diff`)
   
Pain_points[1:10, 1:10]
## # A tibble: 10 × 10
##       ID   `1`   `2`   `3`   `4`   `5`   `6`   `7`   `8`   `9`
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1     1     0     0     0     0     0     1     0     1    -1
##  2     2    -2    -2    -1    -1    -2    -2     2     0    -1
##  3     3     0     1     0     1     0    -1     1     1     1
##  4     4    -1    -1     1     0     0    -2    -2     0    -2
##  5     5    -1     0    -2    -1    -1    -2    -1    -1    -1
##  6     6     0     1     2     1     1     1     0     0     1
##  7     7    -1    -1    -1     0     0    -2     2     2     1
##  8     8     0     1     0     0     0     1    -1     2     1
##  9     9     0     1     0     0    -2     0     1    -1     0
## 10    10    -2    -3    -2    -2    -1    -2    -3    -2    -2

Cleaning Demographics

In this section, we will clean the demographics.
This is how our current demographic dataframe looks like:

vis_dat(demographics, sort_type = FALSE)

First, we create a function which can detect dichotomous variables (e.g. cells contain “Ausgewählt” or “Nicht ausgewählt”). We store the column number in a vector as a means to recode them as numeric variables (0 or 1) later on.

is_yes_no_question <- function(x) {
   vec <- c()
   counter <- 1
   
   for (i in 1:ncol(x)) {
      if (x[1, i] == "Ausgewählt" |
          x[1, i] == "Nicht ausgewählt") {
         vec[counter] <-  i
         counter <- counter + 1
      }
   }
   return(vec)
}

yes_no_cols <- is_yes_no_question(demographics)
yes_no_cols
##  [1]  9 10 11 12 13 14 15 16 17 18 19 20 40 41 42 43 44 45 46 47 48

We then recode said variables to 1 and 0 respectively.

demographics <- demographics %>% 
   rowwise() %>% 
   mutate_at(.vars = yes_no_cols, 
             .funs = ~if_else(. == "Ausgewählt", 1, 0))

We’ll then turn every character column as a factor.

demographics <- demographics %>% mutate_if(is.character, factor)

Then we order the factors according to their proper order if it is ordinal scaled. This is a tedious step, as we have to look at each variable…

#levels: from smallest to biggest

demographics$transportation_pro_woche <-
   factor(demographics$transportation_pro_woche,
          levels = c("Mehrmals in der Woche",
                     "Täglich"))
#still unsure:
#demographics$prozent_berufstätig <- factor(demographics$prozent_berufstätig)

demographics$nutzung_oev <-
   factor(
      demographics$nutzung_oev,
      levels = c(
         "Einmal im Monat oder seltener",
         "Mehrmals im Monat",
         "Einmal in der Woche",
         "Mehrmals in der Woche",
         "Täglich"
      )
   )

demographics <- demographics %>%
   mutate_at(.vars = vars(starts_with("aw_")),
             .funs = ~ factor(
                . ,
                levels = c(
                   "Nie",
                   "Seltener",
                   "Mehrmals im Monat",
                   "Mehrmals pro Woche",
                   "(fast) Täglich"
                )
             ))

demographics <- demographics %>%
   mutate_at(.vars = vars(starts_with("fz_")),
             .funs = ~ factor(
                . ,
                levels = c(
                   "Nie",
                   "Seltener",
                   "Mehrmals im Monat",
                   "Mehrmals pro Woche",
                   "(fast) Täglich"
                )
             ))

demographics$neuen_ort <-
   factor(
      demographics$neuen_ort,
      levels = c(
         "Nie",
         "1 mal im Monat oder seltener",
         "2-3 mal im Monat",
         "4 mal im Monat oder häufiger"
      )
   )

demographics$privates_auto <-
   factor(
      demographics$privates_auto,
      levels = c(
         "Nein, ich habe keinen Zugang zu einem privaten Auto",
         "Ja, aber es gehört Bekannten/Freunden oder jemand aus der Familie in einem anderen Haushalt",
         "Ja, aber es gehört jemand im selben Haushalt",
         "Ja, und es gehört mir selbst"
      )
   )

demographics$ausflug_oev <-
   factor(
      demographics$ausflug_oev,
      levels = c(
         "Nie",
         "Seltener",
         "2-3 mal im Monat",
         "1 mal in der Woche oder häufiger"
      )
   )

demographics$störungen_oev <-
   factor(
      demographics$störungen_oev,
      levels = c(
         "Nie",
         "1-2 mal im letzten Monat",
         "3-5 mal im letzten Monat",
         "6 mal oder häufiger im letzten Monat"
      )
   )

demographics$störungen_strasse <-
   factor(
      demographics$störungen_strasse,
      levels = c(
         "Nie",
         "1-2 mal im letzten Monat",
         "3-5 mal im letzten Monat",
         "6 mal oder häufiger im letzten Monat"
      )
   )

demographics$technologische_entwicklung <-
   factor(
      demographics$technologische_entwicklung,
      levels = c(
         "Technologische Entwicklungen interessieren mich nicht wirklich, ausser sie bieten mir einen konkreten Nutzen im Alltag",
         "Ich warte meist ab, bis mir andere von ihren Erfahrungen erzählen, bevor ich neue Technologien ausprobiere",
         "Ich probiere gerne Neues aus, verlasse mich dabei aber auf Produkte von bekannten und etablierten Marken",
         "Ich bin oft einer/eine der ersten in meinem Umfeld, der/die neue Technologien, Apps oder Gadgets ausprobiert"
      )
   )

Cleaning Profiling Questions

The last datacleaning step would be to clean the PR_ratings dataset. Luckily, the data are stored as “[number] = [meaning]” again, so we can just extract the first character (in this case always the number) and turn them to a numeric variable again, like we did in the IMP_SAT dataset.

PR_ratings <- PR_ratings %>%
   mutate_at(
      .vars = vars(starts_with("PR")),
      .funs = ~ str_extract(string = ., pattern = "^.")
   ) %>%
   mutate_if(.predicate = is.character,
             .funs = as.numeric) %>% 
   mutate(ID = as.numeric(ID))

PR_ratings[1:6,1:6]
##   ID PR12_1 PR12_2 PR12_3 PR12_4 PR12_5
## 1  1      4      3      4      3      3
## 2  2      1      2      3      2      3
## 3  3      1      1      1      1      4
## 4  4      3      3      2      2      4
## 5  5      1      2      5      2      2
## 6  6      2      4      3      3      4

We can then combine the demographics dataset with the PR_ratings dataset, as both contains useful information from the customer.

demographics_full <- demographics %>% 
   mutate(ID = as.numeric(ID)) %>% 
   full_join(PR_ratings, by = "ID")

This is how the dataframe looks like:

vis_dat(demographics_full, sort_type = F)

As we can see, the factor variables are still not numeric (which we need for the clustering method). As we’ve ordered them according to their levels in the previous steps, we can now turn them into numeric variables as well.

demographics_full <- demographics_full %>% mutate_if(is.factor, as.numeric)

vis_dat(demographics_full, sort_type = F)

Analysis

After cleaning the data, we can now start to analyse the data. The analysis is split between two pipelines. In the first section, we tried to cluster our costumers and see whether there are “archetypes” for which we can then base our future implementations.

In the second section, we tried to predict whether the top pain points can be predicted using only demographic data.

Segmentation of Customers (First Analysis Pipeline)

We have to standardize the whole dataframe first (except the ID variables) because the scales were not the same for each variable.

demog_scaled <- demographics_full %>% 
   select(-ID) %>% 
   scale(scale = TRUE)

As we still have some missing values in some cases, we’ll try and impute them according to the variables median. Are there other/ better ways?

repl_na_with_median <- function(x){
   mode_val <- median(x, na.rm = TRUE)
   x[is.na(x)] <- mode_val
   return(x)
}


vec_miss <- c()
for (i in 1:ncol(demog_scaled)){
   if (sum(is.na(demog_scaled[,i]) > 0)){
      vec_miss <- c(vec_miss, i)
   }
}


demog_scaled_imputed <- demog_scaled %>% 
   as.data.frame() %>% 
   mutate_at(.vars = vec_miss, .funs = repl_na_with_median) %>% 
   as.matrix()

Now we can try and cluster the demographics data. We’ll try and “find” the best cluster size with a for loop, which then calculates the F-statistic for each cluster. The F-statistic is a ratio of between cluster sum of squares and within cluster sum of squares. In other words, we get a numeric value where we have a good fit between between and within cluster distances. We’ll take the cluster number which has the biggest value. Is this a good approach?

set.seed(42)
Fstat <- c()
for (i in seq(from = 2, to = 10, by = 1)){
   k <- i
   
   km_fit <- kmeans(demog_scaled_imputed, centers = k)
   
   N <- nrow(Pain_points)
   
   Fstat[i-1] <- (km_fit$betweenss / (k-1)) / (km_fit$tot.withinss / (N-k))
}

plot(seq(from = 2, to = 10, by = 1), y = Fstat)

In this case, the F-value is largest with 2 clusters.

We’ll create a new variable group with the clusters.

km_fit <- kmeans(demog_scaled_imputed, centers = 2)

demographics_full$group <- km_fit$cluster

Let’s try and visualize mean Importance and Satisfaction scores according to the two different clusters!

IMP_SAT_pain_points <- demographics_full %>%
   select(ID, group) %>% 
   full_join(IMP_SAT_pain_points, by = "ID")
   
IMP_SAT_pain_points %>% 
   group_by(group,Metric_Nr) %>% 
   summarize(mean_IMP = mean(IMP),
             mean_SAT = mean(SAT)) %>% 
   ggplot(aes(mean_IMP, mean_SAT))+
   geom_label(aes(label = Metric_Nr),
              label.padding = unit(.05, "lines"),
              size = 3)+
   facet_wrap(~group)+
   theme_bw()

Predicting Pain Point Variability with Demographics

In this section, we try a separate approach in first identifying the top 10 pain points out of the 106 metrics, and then predict with only the demographic data.

Let us start with picking the top X pain points first.

As the results and interpretation can quite differ for different approaches, we try to first filter for all the metrics which has a mean importance value above 3.5 (still arbitrarily set). In this way, we make sure that the resulting metrics are important to the customers, but are unsatisfied with the current state (thus the pain point is large). Otherwise, we would risk in having metrics coming out on top which are not of concern to the costumers.

Here are the resulting metrics, sorted my their mean importance

mean_imp <- IMP_SAT_pain_points %>% 
   group_by(Metric_Nr) %>% 
   summarize(mean_importance = mean(IMP)) %>% 
   filter(mean_importance >= 3.5) %>% 
   arrange(desc(mean_importance))

mean_imp
## # A tibble: 78 × 2
##    Metric_Nr mean_importance
##    <chr>               <dbl>
##  1 81                   4.12
##  2 97                   4.04
##  3 52                   4.02
##  4 30                   3.99
##  5 40                   3.97
##  6 46                   3.93
##  7 48                   3.91
##  8 85                   3.91
##  9 50                   3.91
## 10 84                   3.90
## # ℹ 68 more rows

We can now filter for these specific metrics, and calculate the mean pain point (that is, the mean difference between importance and satisfaction rating) and plot it. But first let us give those pain points their names back so we know what these pain points are.

metric_labels <- read.xlsx("Hybridmobilität systemized.xlsx", sheet =3, startRow =1,colNames = F)

colnames(metric_labels) <- c("Metric_Nr", "Label")

metrics_to_take <- mean_imp %>% pull(Metric_Nr) %>% as.numeric()

top_15_Pain_points <- IMP_SAT_pain_points %>% 
   filter(Metric_Nr %in% metrics_to_take) %>% 
   group_by(Metric_Nr) %>% 
   summarize(`Mean Pain Points` = mean(Diff)) %>% 
   slice_max(order_by = `Mean Pain Points`, n = 15) %>% 
   mutate(Metric_Nr = as.numeric(Metric_Nr)) %>% 
   left_join(metric_labels, by = "Metric_Nr", keep = FALSE)

top_15_Pain_points$label <- str_wrap(top_15_Pain_points$Label, width = 60)
top_15_Pain_points$label <- reorder(top_15_Pain_points$label, X = top_15_Pain_points$`Mean Pain Points`)


top_15_Pain_points %>%
   ggplot(aes(label, y = `Mean Pain Points`))+
   geom_segment(aes(x = label, xend = label, y = 0, yend = `Mean Pain Points`))+
   geom_point(color = "#28C2B9", size = 4, alpha = .9)+
   theme_minimal()+
   labs(x = "Metric Number", title = "Top 15 Pain Points in Decreasing Order")+
   coord_flip()+
   theme(panel.grid.minor.y = element_blank(),
         panel.grid.major.y = element_blank(),
         axis.text.y = element_text(size = 12))+
   ylim(0,1)

According to this plot, the top 6 metrics have higher pain points, whereas the 7th to 11th and 12th to 15th metrics are one step lower.

DV_set <- IMP_SAT_pain_points %>% 
   filter(Metric_Nr %in% top_15_Pain_points$Metric_Nr)%>% 
   mutate(Metric_Nr = as.numeric(Metric_Nr)) %>%
   left_join(metric_labels, by = "Metric_Nr")

As we have 82 demographic and profiling variables to choose from for the regularized prediction, it is better to choose the variables a priori before tucking them into the model.

Though we are not quite sure how biased this approach is, one could view a correlation plot and just select those variables which have higher correlation with others, so we can make sure not to include redundant variables.

# demographics_PR_full <- demographics_full %>% 
#    full_join(PR_ratings, by = "ID")
# 
# corrplot(cor(demographics_PR_full))

corrplot(cor(PR_ratings))

From this plot above, we see that most of the correlations are in the profiling questions, but they are still small or not present. But each profiling question has its subgroup (seen by the smaller squares). We’ll choose from each profiling question the ones which has the “biggest” correlations, and we will give them their respective labels as well.

selected_PR <- PR_ratings %>% select(ID, PR12_3, PR13_1, PR13_6, PR14_7, PR15_1)

PR_name <- read.xlsx("Hybridmobilität systemized.xlsx", sheet = 2, startRow =1 ,colNames = T) %>% 
   select(names(selected_PR))

names(selected_PR) <- PR_name[1,]
names(selected_PR)
## [1] "ID"                                                                                                                    
## [2] "Aus ökologischen Gründen fühle ich mich verpflichtet, den PKW im Alltag möglichst oft stehen zu lassen."               
## [3] "Für mich ist es schwer, die Wege in meinem Alltag mit öffentlichen Verkehrsmitteln anstatt mit dem Auto zurückzulegen."
## [4] "Personen, die sehr viele ihrer Wege mit den öV zurücklegen, beeindrucken mich."                                        
## [5] "Mein fahrerisches Geschick beim Autofahren anwenden zu können, macht mir Spass."                                       
## [6] "Ich bin gerne mit dem Velo unterwegs."

For the demographics, we’ll pick the generic ones (like age, gender), and handpick those that seem likely to be predictive of the metrics.

selected_demog <- demographics %>% select(ID, Alter, Geschlecht, Wohnort, nutzung_oev, SBB_App, Google_Maps, neuen_ort, aw_zug, aw_tram, fz_zug) %>% 
   mutate(ID = as.numeric(ID))

Let’s put everything together!

prediction_df <- DV_set %>% 
   select(ID, Diff, Label) %>% 
   pivot_wider(names_from = Label, values_from = Diff) %>% 
   full_join(selected_demog, by = "ID") %>% 
   full_join(selected_PR, by = "ID")

DV <- prediction_df %>% select(top_15_Pain_points$Label)

IV <- prediction_df %>% select(Alter:ncol(prediction_df))

One last check whether the missing data is still present (maybe we had luck and picked variables without missing data):

vis_dat(IV, sort_type = F)

Sadly, the variables of interest have missing values. It seems like there is a pattern though… Let’s see whether we can identify why missingness occured in our dataset. Only the questions where the customer gets asked how they commute to work is missing. Maybe it has to do with their current occupation?

demographics %>% filter(is.na(aw_zug)) %>% 
   select(prozent_berufstätig, aw_zug, nutzung_oev)
## # A tibble: 9 × 3
## # Rowwise: 
##   prozent_berufstätig            aw_zug nutzung_oev                  
##   <fct>                          <fct>  <fct>                        
## 1 In Pension                     <NA>   Täglich                      
## 2 Hausfrau/-mann                 <NA>   Einmal im Monat oder seltener
## 3 Hausfrau/-mann                 <NA>   Mehrmals in der Woche        
## 4 In Pension                     <NA>   Einmal im Monat oder seltener
## 5 Studierende oder in Ausbildung <NA>   Mehrmals in der Woche        
## 6 Studierende oder in Ausbildung <NA>   Täglich                      
## 7 20% bis 80%                    <NA>   Einmal im Monat oder seltener
## 8 80% und mehr                   <NA>   Täglich                      
## 9 Hausfrau/-mann                 <NA>   Einmal im Monat oder seltener

It seems like these people are not in the work force right now, and thus have left the questions concerning “Arbeitsweg” unanswered. If I may, I’ll impute “Nie” in these variables so we still work with all the customers.

IV[is.na(IV)] <- "Nie"
vis_dat(IV, sort_type = F)

In the last step, we try and predict each dependent variable (in this case the pain points for each customer for the all the metrics) separately with the selected demographic and profiling data. For this, we’ll use regularized regression and try to use 10-fold cross validation so we make sure we get out-of sample performance and see whether our model is generalizable or not.

res_reg_cv_df <- data.frame()

set.seed(42)

IV_sc <- IV %>% 
   mutate_if(.predicate = is.numeric, .funs = scale)

#start of for loop
for (i in 1:ncol(DV)) {
   
cv <- cv.glmnet(x =  as.matrix(IV_sc), y = as.matrix(DV)[, i], 
                type.measure = "mse",
                nfolds = 10)

#take the best lambda, and fit a regularized regression
best_lambda <- cv$lambda.min

fin_model <- glmnet(
   x = IV_sc,
   y = as.matrix(DV)[, i],
   alpha = 1,
   lambda = best_lambda
)

#calculate rsq values
rsq_mean <- cv$glmnet.fit$dev.ratio %>% mean()
rsq_max <- cv$glmnet.fit$dev.ratio %>% max()
rsq_sd <- cv$glmnet.fit$dev.ratio %>% sd()

#sort coefficients by value
coefficients <- coef(fin_model)[, 1] %>% round(digits = 3) %>% abs() %>% sort(decreasing = TRUE)

#remove the intercept if it is on the list of top 3
pos_intercept <- which(names(coefficients) == "(Intercept)")
coefficients <- coefficients[-c(pos_intercept)]

relevant_coeff <- coefficients %>% head(3) %>% names()
coeff_val <-  coefficients %>% head(3)

summary_cv <- matrix(
   data = c(names(DV[, i]),
           best_lambda,
           rsq_mean,
           rsq_max,
           rsq_sd,
           relevant_coeff,
           coeff_val),
   byrow = TRUE,
   ncol = 11) %>% 
   as.data.frame()

res_reg_cv_df <- rbind(res_reg_cv_df, summary_cv)
}

names(res_reg_cv_df) <- c("Question",
                          "best_lambda",
                          "rsq_mean",
                          "rsq_max",
                          "rsq_sd",
                          paste0("relevant_coeff_", 1:3),
                          paste0("coeff_val_", 1:3)
)




# 
# train_ctr <- trainControl(method = "cv", number = 10)
# model <-
#    train(
#       x = IV, 
#       y = DV$`so früh wie möglich zu wissen, welche alternative Route man nehmen kann, wenn eine Störung oder Verspätung eintritt`,
#       method = "glmnet",
#       metric = "RMSE",
#       preProcess = c("center", "scale"),
#       trControl = train_ctr
#    )
# 
# print(model)
# 
# model$results$Rsquared
# fin_model <- model$finalModel
# coefficient <- coef(fin_model)
# print(coefficient)

We can now clean up our resulting dataframe, where we have the relevant out of sample metrics, as well as the weights of the top 3 variables for each of our pain points.

res_reg_cv_df$Label <- str_wrap(res_reg_cv_df$Question, width = 60)
res_reg_cv_df$Label <- reorder(res_reg_cv_df$Label, top_15_Pain_points$`Mean Pain Points`)


res_reg_cv_df <- res_reg_cv_df %>% 
   mutate_at(.vars = vars(contains("lambda") | contains("rsq") | contains("val")),
             .funs = as.numeric)

res_reg_cv_df %>% 
   ggplot(aes(Label, rsq_mean)) +
   geom_col()+
   coord_flip()+
   labs(y = TeX("Mean\\ $R^2$"),
        x = "Pain Point")+
   geom_errorbar(aes(ymin = rsq_mean- rsq_sd, ymax = rsq_mean + rsq_sd), width = .2)+
   theme_bw()+
   theme(axis.text.y = element_text(size = 15),
         axis.text.x = element_text(size = 14),
         axis.title.y = element_text(size = 18),
         axis.title.x = element_text(size = 18))+
   scale_y_continuous(breaks = seq(0, 0.15, by = 0.02), limits = c(0, 0.15))

Which of the variables had higher importance in predicting the pain points?

coeff_val_df <- res_reg_cv_df %>% 
   select(Question, starts_with("coeff")) %>% 
   pivot_longer(cols = starts_with("coeff"),
                names_prefix = "coeff_val_",
               names_to = "coeff_Nr",
               values_to = "coeff_val"
                  )

coeff_label_df <- res_reg_cv_df %>% 
   select(Question, starts_with("relevant")) %>% 
   pivot_longer(cols = starts_with("relevant"),
                names_prefix = "relevant_coeff_",
                names_to = "coeff_Nr",
                values_to = "variable")

coeff_df <- coeff_val_df %>% 
   full_join(coeff_label_df, by = c("Question", "coeff_Nr"))

coeff_df_filtered <- coeff_df %>% filter(coeff_val >=.1) %>% 
   mutate(
      variable2 = case_when(
         variable == "Für mich ist es schwer, die Wege in meinem Alltag mit öffentlichen Verkehrsmitteln anstatt mit dem Auto zurückzulegen."  ~ "Schwierigkeiten ÖV statt Auto zu benutzen",
         variable == "Mein fahrerisches Geschick beim Autofahren anwenden zu können, macht mir Spass." ~ "Spass am Autofahren",
         .default = variable
      )
   )

So we can’t predict whether it is a pain point or not…. only 2.7% - 10% can be explained.

Here is our final plot:

labels_for_lollipop <-  coeff_df_filtered %>% 
   group_by(Question) %>% 
   summarize(coeff = paste(variable2, collapse = ", "))

df_for_lollipop <- top_15_Pain_points %>%
   left_join(labels_for_lollipop, by = c("Label" = "Question")) %>% 
   mutate(coeff2 = ifelse(is.na(coeff), " ",
                          str_wrap(
                             str_replace(coeff, pattern = "\\.", replacement = " "),
                                width = 50
                             )
                          ))



df_for_lollipop %>% 
   ggplot(aes(x = label, y = `Mean Pain Points`))+
   geom_segment(aes(x = label, xend = label, y = 0, yend = `Mean Pain Points`))+
   geom_point(color = "#28C2B9", size = 4, alpha = .9)+
   theme_minimal()+
   labs(x = "Pain Point")+
   coord_flip()+
   theme(panel.grid.minor.y = element_blank(),
         panel.grid.major.y = element_blank(),
         axis.text.y = element_text(size = 15),
         axis.text.x = element_text(size = 14),
         axis.title.x = element_text(size = 18),
         axis.title.y = element_text(size = 18))+
   scale_y_continuous(breaks = seq(0, 1.3, by = .2), limits = c(0,1.3))+
   #ylim(c(0,1.3))+
   geom_text(aes(y = `Mean Pain Points`, label = coeff2), hjust = 0, nudge_y = .1, size = 5)